Below are 100 randomly selected rows from the dataset.
The table below shows several metrics calculated against the various columns/variables. These metrics include: the number of unique values, number of NAs, the maximum value, the minimum value, and the mean/average.
A quick look at counts of the various levels of the categorical variables found in the data is shown in the following set of bar charts. Trelliscopejs is an R package that enables one to quickly sort and/or filter through various slices of their previously generated visualizations/plots of their data. In one case we may want to look through all levels of the variable, dna_visittrafficsubtype (I don’t show that here). To do this we simply create a generic plot, apply only one slice/level/subset (subtype) at a time to the plot, and create any features that might help us learn about the effects of dna_visittrafficsubtype on the data. We can then sort and filter through the feature set to find anomolies or interesting behaviors that might influence modeling or help us understand the relationship between variables in the data.
For this dataset the number of categorical variables is not overwhelming so the power of trelliscope may appear limited. However, when one needs to quickly move through 100s, 1000s or even millions of slices of the data, trelliscope provides a seemless interface to manage so many subsets of data and better discover interesting features.
The only numeric variable, xsell_day_exact, is highly skewed so a log transformation better shows the distribution of values.
As noted in the section below, outliers and NAs make it difficult to initially see the trends over time of these two time series related variables.
33,407 prospectids appear multiple times, accounting for 85,349 rows in the dataset.33.8% of the values for that variable. May want to look at a cummulative view of some kind.A closer look at dna_visittrafficsubtype shows that many of the subtypes are rarely found in this dataset. Grouping or combining these in a meaningful manner may help, but unfortunately I doubt I have sufficient information or experience to group the levels of this variable at this time.
I must not understand the regtenure column yet. I would have assumed that regtenure was just a categorical view of xsell_day_exact. From the histograms below I must be missing something and those two variables are not linked in any way.
After removing the outlier dates (noted above) for ordercreatedate we can better see the general trend. Though really to see that trend one would need to zoom into the main group of data (not include the high points from last year’s sale around Thanksgiving). In fact, after zooming in to see that general positive sloped trend one notices several days of observations that are very low and warrant further consideration (March 21-30, 2018).
After removing the NAs from dnatestactivationdayid we can better see the general trend.
Variance appears to tighten up in 2016-2017 and the obvious drop in late 2016 to 2017 will cause problems for most models. Forecasting or predicting could prove difficult if the model isn’t able to account for the sudden drop. The drop is also temporary, meaning it will always exist at the end of the time series because that cross-sell conversion requires time to happen. Recent orders purchased (say in the last week, or end of the time series) will have a very low percent of cross-sells because most of the cross-sells require more time to occur. A few weeks/months later that cross-sell percentage will have increased dramatically, because sufficient time has occurred to allow the customer time to buy the cross-sell product. In short, modeling the cross-sell percentage will always have this limitation (recent drop) and one would have to overcome that by weighting past observations higher than the most recent. It would be somewhat difficult (definitely tricky).
A more detailed view of this daily xsell conversion may help us understand what is influencing this behavior and how that might affect model construction. Given the statements above it wouldn’t be wise to pursue this further, so I only show it broken out by the customer type group.